This project explores the various clustering algorithms for segmenting information using various helpful packages in Python. Models applied in the analysis to cluster high dimensional data included the K-Means, Affinity Propagation, Mean Shift, Spectral Clustering and Agglomerative Clustering algorithms. The different clustering algorithms were evaluated using the silhouete coefficient which measures how well-separated the clusters are and how similar an object is to its own cluster (cohesion) compared to other clusters (separation). All results were consolidated in a Summary presented at the end of the document.
Cluster analysis is a form of unsupervised learning method aimed at identifying similar structural patterns in an unlabeled data set by segmenting the observations into clusters with shared characteristics as compared to those in other clusters. The algorithms applied in this study attempt to formulate partitioned segments from the data set through the hierarchical (either agglomeratively when smaller clusters are merged into the larger clusters or divisively when larger clusters are divided into smaller clusters) and non-hierarchical (when each observation is placed in exactly one of the mutually exclusive clusters) methods.
Datasets used for the analysis were separately gathered and consolidated from various sources including:
This study hypothesized that various death rates by major cancer types contain inherent patterns and structures within the data, enabling the grouping of similar countries and the differentiation of dissimilar ones.
Due to the unspervised learning nature of the analysis, there is no target variable defined for the study.
The clustering descriptor variables for the study are:
The target descriptor variables for the study are:
The metadata variables for the study are:
##################################
# Installing shap package
##################################
# !pip install geopandas
##################################
# Setting the Python Environment
##################################
import os
os.environ["OMP_NUM_THREADS"] = '1'
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
%matplotlib inline
from operator import add,mul,truediv
from sklearn.preprocessing import PowerTransformer, StandardScaler
from scipy import stats
from sklearn.cluster import KMeans, AffinityPropagation, MeanShift, SpectralClustering, AgglomerativeClustering, Birch, BisectingKMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import geopandas as gpd
##################################
# Setting Global Options
##################################
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
##################################
# Loading the dataset
##################################
cancer_death_rate = pd.read_csv('CancerDeathsByCountryCode.csv')
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate.shape)
Dataset Dimensions:
(208, 16)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_death_rate.dtypes)
Column Names and Data Types:
COUNTRY object CODE object PROCAN float64 BRECAN float64 CERCAN float64 STOCAN float64 ESOCAN float64 PANCAN float64 LUNCAN float64 COLCAN float64 LIVCAN float64 SMPREV float64 OWPREV float64 ACSHAR float64 GEOLAT float64 GEOLON float64 dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_death_rate.head()
| COUNTRY | CODE | PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | GEOLAT | GEOLON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 6.3700 | 8.6700 | 3.9000 | 29.3000 | 6.9600 | 2.7200 | 12.5300 | 8.4300 | 10.2700 | 11.9000 | 23.0000 | 0.2100 | 33.9391 | 67.7100 |
| 1 | Albania | ALB | 8.8700 | 6.5000 | 1.6400 | 10.6800 | 1.4400 | 6.6800 | 26.6300 | 9.1500 | 6.8400 | 20.5000 | 57.7000 | 7.1700 | 41.1533 | 20.1683 |
| 2 | Algeria | DZA | 5.3300 | 7.5800 | 2.1800 | 5.1000 | 1.1500 | 4.2700 | 10.4600 | 8.0500 | 2.2000 | 11.2000 | 62.0000 | 0.9500 | 28.0339 | 1.6596 |
| 3 | American Samoa | ASM | 20.9400 | 16.8100 | 5.0200 | 15.7900 | 1.5200 | 5.1900 | 28.0100 | 16.5500 | 7.0200 | NaN | NaN | NaN | -14.2710 | -170.1322 |
| 4 | Andorra | AND | 9.6800 | 9.0200 | 2.0400 | 8.3000 | 3.5600 | 10.2600 | 34.1800 | 22.9700 | 9.4400 | 26.6000 | 63.7000 | 11.0200 | 42.5462 | 1.6016 |
##################################
# Performing a general exploration of the numeric variables
##################################
if (len(cancer_death_rate.select_dtypes(include='number').columns)==0):
print('No numeric columns identified from the data.')
else:
print('Numeric Variable Summary:')
display(cancer_death_rate.describe(include='number').transpose())
Numeric Variable Summary:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| PROCAN | 208.0000 | 11.7260 | 7.6965 | 2.8100 | 6.5875 | 10.0050 | 13.9900 | 54.1500 |
| BRECAN | 208.0000 | 11.3350 | 4.3649 | 4.6900 | 8.3975 | 10.5600 | 13.0950 | 37.1000 |
| CERCAN | 208.0000 | 6.0651 | 5.1204 | 0.7100 | 1.8575 | 4.4800 | 9.0575 | 39.9500 |
| STOCAN | 208.0000 | 10.5975 | 5.8993 | 3.4000 | 6.6350 | 9.1550 | 13.6725 | 46.0400 |
| ESOCAN | 208.0000 | 4.8946 | 4.1320 | 0.9600 | 2.3350 | 3.3100 | 5.4150 | 25.7600 |
| PANCAN | 208.0000 | 6.6004 | 3.0552 | 1.6000 | 4.2300 | 6.1150 | 8.7450 | 19.2900 |
| LUNCAN | 208.0000 | 21.0217 | 11.4489 | 5.9500 | 11.3800 | 20.0200 | 27.5125 | 78.2300 |
| COLCAN | 208.0000 | 13.6945 | 5.5475 | 4.9400 | 9.2775 | 12.7950 | 17.1325 | 31.3800 |
| LIVCAN | 208.0000 | 5.9826 | 9.0501 | 0.6500 | 2.8400 | 3.8950 | 6.0750 | 115.2300 |
| SMPREV | 186.0000 | 17.0140 | 8.0416 | 3.3000 | 10.4250 | 16.4000 | 22.8500 | 41.1000 |
| OWPREV | 191.0000 | 48.9963 | 17.0164 | 18.3000 | 31.2500 | 55.0000 | 60.9000 | 88.5000 |
| ACSHAR | 187.0000 | 6.0013 | 4.1502 | 0.0030 | 2.2750 | 5.7000 | 9.2500 | 20.5000 |
| GEOLAT | 208.0000 | 19.0381 | 24.3776 | -40.9006 | 4.1377 | 17.3443 | 40.0876 | 71.7069 |
| GEOLON | 208.0000 | 16.2690 | 71.9576 | -175.1982 | -11.1506 | 19.4388 | 47.8118 | 179.4144 |
##################################
# Performing a general exploration of the object variable
##################################
if (len(cancer_death_rate.select_dtypes(include='object').columns)==0):
print('No object columns identified from the data.')
else:
print('Object Variable Summary:')
display(cancer_death_rate.describe(include='object').transpose())
Object Variable Summary:
| count | unique | top | freq | |
|---|---|---|---|---|
| COUNTRY | 208 | 208 | Afghanistan | 1 |
| CODE | 203 | 203 | AFG | 1 |
##################################
# Performing a general exploration of the categorical variables
##################################
if (len(cancer_death_rate.select_dtypes(include='category').columns)==0):
print('No categorical columns identified from the data.')
else:
print('Categorical Variable Summary:')
display(cancer_rate.describe(include='category').transpose())
No categorical columns identified from the data.
Data quality findings based on assessment are as follows:
##################################
# Counting the number of duplicated rows
##################################
cancer_death_rate.duplicated().sum()
0
##################################
# Gathering the data types for each column
##################################
data_type_list = list(cancer_death_rate.dtypes)
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(cancer_death_rate.columns)
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(cancer_death_rate)] * len(cancer_death_rate.columns))
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(cancer_death_rate.isna().sum(axis=0))
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(cancer_death_rate.count())
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
data_type_list,
row_count_list,
non_null_count_list,
null_count_list,
fill_rate_list),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count',
'Fill.Rate'])
display(all_column_quality_summary)
| Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
|---|---|---|---|---|---|---|
| 0 | COUNTRY | object | 208 | 208 | 0 | 1.0000 |
| 1 | CODE | object | 208 | 203 | 5 | 0.9760 |
| 2 | PROCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 3 | BRECAN | float64 | 208 | 208 | 0 | 1.0000 |
| 4 | CERCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 5 | STOCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 6 | ESOCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 7 | PANCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 8 | LUNCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 9 | COLCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 10 | LIVCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 11 | SMPREV | float64 | 208 | 186 | 22 | 0.8942 |
| 12 | OWPREV | float64 | 208 | 191 | 17 | 0.9183 |
| 13 | ACSHAR | float64 | 208 | 187 | 21 | 0.8990 |
| 14 | GEOLAT | float64 | 208 | 208 | 0 | 1.0000 |
| 15 | GEOLON | float64 | 208 | 208 | 0 | 1.0000 |
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
4
##################################
# Identifying the columns
# with Fill.Rate < 1.00
##################################
if (len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])==0):
print('No columns with Fill.Rate < 1.00.')
else:
display(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)].sort_values(by=['Fill.Rate'], ascending=True))
| Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
|---|---|---|---|---|---|---|
| 11 | SMPREV | float64 | 208 | 186 | 22 | 0.8942 |
| 13 | ACSHAR | float64 | 208 | 187 | 21 | 0.8990 |
| 12 | OWPREV | float64 | 208 | 191 | 17 | 0.9183 |
| 1 | CODE | object | 208 | 203 | 5 | 0.9760 |
##################################
# Identifying the columns
# with Fill.Rate < 1.00
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1.00)]
##################################
# Gathering the metadata labels for each observation
##################################
row_metadata_list = cancer_death_rate["COUNTRY"].values.tolist()
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(cancer_death_rate.columns)] * len(cancer_death_rate))
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(cancer_death_rate.isna().sum(axis=1))
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_metadata_list,
column_count_list,
null_row_list,
missing_rate_list),
columns=['Row.Name',
'Column.Count',
'Null.Count',
'Missing.Rate'])
display(all_row_quality_summary)
| Row.Name | Column.Count | Null.Count | Missing.Rate | |
|---|---|---|---|---|
| 0 | Afghanistan | 16 | 0 | 0.0000 |
| 1 | Albania | 16 | 0 | 0.0000 |
| 2 | Algeria | 16 | 0 | 0.0000 |
| 3 | American Samoa | 16 | 3 | 0.1875 |
| 4 | Andorra | 16 | 0 | 0.0000 |
| ... | ... | ... | ... | ... |
| 203 | Vietnam | 16 | 0 | 0.0000 |
| 204 | Wales | 16 | 4 | 0.2500 |
| 205 | Yemen | 16 | 0 | 0.0000 |
| 206 | Zambia | 16 | 0 | 0.0000 |
| 207 | Zimbabwe | 16 | 0 | 0.0000 |
208 rows × 4 columns
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
25
##################################
# Identifying the rows
# with Missing.Rate > 0.00
##################################
row_missing_rate = all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)]
##################################
# Identifying the rows
# with Missing.Rate > 0.00
##################################
if (len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])==0):
print('No rows with Missing.Rate > 0.00.')
else:
display(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)].sort_values(by=['Missing.Rate'], ascending=False))
| Row.Name | Column.Count | Null.Count | Missing.Rate | |
|---|---|---|---|---|
| 204 | Wales | 16 | 4 | 0.2500 |
| 135 | Northern Ireland | 16 | 4 | 0.2500 |
| 57 | England | 16 | 4 | 0.2500 |
| 186 | Tokelau | 16 | 4 | 0.2500 |
| 161 | Scotland | 16 | 4 | 0.2500 |
| 198 | United States Virgin Islands | 16 | 3 | 0.1875 |
| 173 | South Sudan | 16 | 3 | 0.1875 |
| 158 | San Marino | 16 | 3 | 0.1875 |
| 149 | Puerto Rico | 16 | 3 | 0.1875 |
| 20 | Bermuda | 16 | 3 | 0.1875 |
| 3 | American Samoa | 16 | 3 | 0.1875 |
| 118 | Monaco | 16 | 3 | 0.1875 |
| 74 | Guam | 16 | 3 | 0.1875 |
| 72 | Greenland | 16 | 3 | 0.1875 |
| 136 | Northern Mariana Islands | 16 | 3 | 0.1875 |
| 132 | Niue | 16 | 2 | 0.1250 |
| 140 | Palau | 16 | 2 | 0.1250 |
| 141 | Palestine | 16 | 2 | 0.1250 |
| 181 | Taiwan | 16 | 2 | 0.1250 |
| 41 | Cook Islands | 16 | 2 | 0.1250 |
| 125 | Nauru | 16 | 1 | 0.0625 |
| 154 | Saint Kitts and Nevis | 16 | 1 | 0.0625 |
| 116 | Micronesia | 16 | 1 | 0.0625 |
| 112 | Marshall Islands | 16 | 1 | 0.0625 |
| 192 | Tuvalu | 16 | 1 | 0.0625 |
##################################
# Formulating the dataset
# with numeric columns only
##################################
cancer_death_rate_numeric = cancer_death_rate.select_dtypes(include='number')
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = cancer_death_rate_numeric.columns
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = cancer_death_rate_numeric.min()
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = cancer_death_rate_numeric.mean()
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = cancer_death_rate_numeric.median()
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = cancer_death_rate_numeric.max()
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0] for x in cancer_death_rate_numeric]
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1] for x in cancer_death_rate_numeric]
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [cancer_death_rate_numeric[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_numeric]
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [cancer_death_rate_numeric[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_numeric]
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = cancer_death_rate_numeric.nunique(dropna=True)
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(cancer_death_rate_numeric)] * len(cancer_death_rate_numeric.columns))
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_death_rate_numeric.skew()
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = cancer_death_rate_numeric.kurtosis()
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_minimum_list,
numeric_mean_list,
numeric_median_list,
numeric_maximum_list,
numeric_first_mode_list,
numeric_second_mode_list,
numeric_first_mode_count_list,
numeric_second_mode_count_list,
numeric_first_second_mode_ratio_list,
numeric_unique_count_list,
numeric_row_count_list,
numeric_unique_count_ratio_list,
numeric_skewness_list,
numeric_kurtosis_list),
columns=['Numeric.Column.Name',
'Minimum',
'Mean',
'Median',
'Maximum',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio',
'Skewness',
'Kurtosis'])
if (len(cancer_death_rate_numeric.columns)==0):
print('No numeric columns identified from the data.')
else:
display(numeric_column_quality_summary)
| Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PROCAN | 2.8100 | 11.7260 | 10.0050 | 54.1500 | 15.4100 | 9.2300 | 2 | 2 | 1.0000 | 198 | 208 | 0.9519 | 2.1250 | 6.1837 |
| 1 | BRECAN | 4.6900 | 11.3350 | 10.5600 | 37.1000 | 10.2900 | 8.9900 | 3 | 2 | 1.5000 | 190 | 208 | 0.9135 | 1.5844 | 5.4634 |
| 2 | CERCAN | 0.7100 | 6.0651 | 4.4800 | 39.9500 | 4.6200 | 1.5200 | 3 | 3 | 1.0000 | 189 | 208 | 0.9087 | 1.9715 | 8.3399 |
| 3 | STOCAN | 3.4000 | 10.5975 | 9.1550 | 46.0400 | 7.0200 | 6.5800 | 2 | 2 | 1.0000 | 196 | 208 | 0.9423 | 2.0526 | 7.3909 |
| 4 | ESOCAN | 0.9600 | 4.8946 | 3.3100 | 25.7600 | 2.5200 | 1.6800 | 3 | 3 | 1.0000 | 180 | 208 | 0.8654 | 2.0659 | 5.2990 |
| 5 | PANCAN | 1.6000 | 6.6004 | 6.1150 | 19.2900 | 3.1300 | 3.0700 | 3 | 2 | 1.5000 | 187 | 208 | 0.8990 | 0.9127 | 1.5264 |
| 6 | LUNCAN | 5.9500 | 21.0217 | 20.0200 | 78.2300 | 10.7500 | 11.6200 | 3 | 2 | 1.5000 | 200 | 208 | 0.9615 | 1.2646 | 2.8631 |
| 7 | COLCAN | 4.9400 | 13.6945 | 12.7950 | 31.3800 | 10.9000 | 12.2900 | 2 | 2 | 1.0000 | 199 | 208 | 0.9567 | 0.7739 | 0.1459 |
| 8 | LIVCAN | 0.6500 | 5.9826 | 3.8950 | 115.2300 | 2.7500 | 2.7400 | 6 | 4 | 1.5000 | 173 | 208 | 0.8317 | 9.1131 | 104.2327 |
| 9 | SMPREV | 3.3000 | 17.0140 | 16.4000 | 41.1000 | 22.4000 | 26.5000 | 4 | 4 | 1.0000 | 141 | 208 | 0.6779 | 0.4096 | -0.4815 |
| 10 | OWPREV | 18.3000 | 48.9963 | 55.0000 | 88.5000 | 61.6000 | 28.4000 | 5 | 3 | 1.6667 | 157 | 208 | 0.7548 | -0.1617 | -0.9762 |
| 11 | ACSHAR | 0.0030 | 6.0013 | 5.7000 | 20.5000 | 0.6900 | 12.0300 | 3 | 2 | 1.5000 | 177 | 208 | 0.8510 | 0.3532 | -0.5657 |
| 12 | GEOLAT | -40.9006 | 19.0381 | 17.3443 | 71.7069 | 55.3781 | 53.4129 | 2 | 2 | 1.0000 | 206 | 208 | 0.9904 | -0.1861 | -0.6520 |
| 13 | GEOLON | -175.1982 | 16.2690 | 19.4388 | 179.4144 | -3.4360 | -8.2439 | 2 | 2 | 1.0000 | 206 | 208 | 0.9904 | -0.2025 | 0.3981 |
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Identifying the numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
if (len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])==0):
print('No numeric columns with First.Second.Mode.Ratio > 5.00.')
else:
display(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
No numeric columns with First.Second.Mode.Ratio > 5.00.
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3)|(numeric_column_quality_summary['Skewness']<(-3))])
1
yy = numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))]
len(yy)
1
##################################
# Identifying the numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
if (len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])==0):
print('No numeric columns with Skewness > 3.00 or Skewness < -3.00.')
else:
display(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))].sort_values(by=['Skewness'], ascending=False))
| Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | LIVCAN | 0.6500 | 5.9826 | 3.8950 | 115.2300 | 2.7500 | 2.7400 | 6 | 4 | 1.5000 | 173 | 208 | 0.8317 | 9.1131 | 104.2327 |
##################################
# Formulating the dataset
# with object column only
##################################
cancer_death_rate_object = cancer_death_rate.select_dtypes(include='object')
##################################
# Gathering the variable names for the object column
##################################
object_variable_name_list = cancer_death_rate_object.columns
##################################
# Gathering the first mode values for the object column
##################################
object_first_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[0] for x in cancer_death_rate_object]
##################################
# Gathering the second mode values for each object column
##################################
object_second_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[1] for x in cancer_death_rate_object]
##################################
# Gathering the count of first mode values for each object column
##################################
object_first_mode_count_list = [cancer_death_rate_object[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_object]
##################################
# Gathering the count of second mode values for each object column
##################################
object_second_mode_count_list = [cancer_death_rate_object[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_object]
##################################
# Gathering the first mode to second mode ratio for each object column
##################################
object_first_second_mode_ratio_list = map(truediv, object_first_mode_count_list, object_second_mode_count_list)
##################################
# Gathering the count of unique values for each object column
##################################
object_unique_count_list = cancer_death_rate_object.nunique(dropna=True)
##################################
# Gathering the number of observations for each object column
##################################
object_row_count_list = list([len(cancer_death_rate_object)] * len(cancer_death_rate_object.columns))
##################################
# Gathering the unique to count ratio for each object column
##################################
object_unique_count_ratio_list = map(truediv, object_unique_count_list, object_row_count_list)
object_column_quality_summary = pd.DataFrame(zip(object_variable_name_list,
object_first_mode_list,
object_second_mode_list,
object_first_mode_count_list,
object_second_mode_count_list,
object_first_second_mode_ratio_list,
object_unique_count_list,
object_row_count_list,
object_unique_count_ratio_list),
columns=['Object.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
if (len(cancer_death_rate_object.columns)==0):
print('No object columns identified from the data.')
else:
display(object_column_quality_summary)
| Object.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | COUNTRY | Afghanistan | Albania | 1 | 1 | 1.0000 | 208 | 208 | 1.0000 |
| 1 | CODE | AFG | PSX | 1 | 1 | 1.0000 | 203 | 208 | 0.9760 |
##################################
# Counting the number of object columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of object columns
# with Unique.Count.Ratio > 10.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Formulating the dataset
# with categorical columns only
##################################
cancer_death_rate_categorical = cancer_death_rate.select_dtypes(include='category')
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = cancer_death_rate_categorical.columns
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[0] for x in cancer_death_rate_categorical]
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[1] for x in cancer_death_rate_categorical]
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [cancer_death_rate_categorical[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_categorical]
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [cancer_death_rate_categorical[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_categorical]
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = cancer_death_rate_categorical.nunique(dropna=True)
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(cancer_death_rate_categorical)] * len(cancer_death_rate_categorical.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
categorical_first_mode_list,
categorical_second_mode_list,
categorical_first_mode_count_list,
categorical_second_mode_count_list,
categorical_first_second_mode_ratio_list,
categorical_unique_count_list,
categorical_row_count_list,
categorical_unique_count_ratio_list),
columns=['Categorical.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
if (len(cancer_death_rate_categorical.columns)==0):
print('No categorical columns identified from the data.')
else:
display(categorical_column_quality_summary)
No categorical columns identified from the data.
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Performing a general exploration of the original dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate.shape)
Dataset Dimensions:
(208, 16)
##################################
# Filtering out the rows with
# with Missing.Rate > 0.00
##################################
cancer_death_rate_filtered_row = cancer_death_rate.drop(cancer_death_rate[cancer_death_rate.COUNTRY.isin(row_missing_rate['Row.Name'].values.tolist())].index)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_filtered_row.shape)
Dataset Dimensions:
(183, 16)
##################################
# Re-evaluating the missing data summary
# for the filtered data
##################################
variable_name_list = list(cancer_death_rate_filtered_row.columns)
null_count_list = list(cancer_death_rate_filtered_row.isna().sum(axis=0))
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
null_count_list),
columns=['Column.Name',
'Null.Count'])
display(all_column_quality_summary)
| Column.Name | Null.Count | |
|---|---|---|
| 0 | COUNTRY | 0 |
| 1 | CODE | 0 |
| 2 | PROCAN | 0 |
| 3 | BRECAN | 0 |
| 4 | CERCAN | 0 |
| 5 | STOCAN | 0 |
| 6 | ESOCAN | 0 |
| 7 | PANCAN | 0 |
| 8 | LUNCAN | 0 |
| 9 | COLCAN | 0 |
| 10 | LIVCAN | 0 |
| 11 | SMPREV | 0 |
| 12 | OWPREV | 0 |
| 13 | ACSHAR | 0 |
| 14 | GEOLAT | 0 |
| 15 | GEOLON | 0 |
##################################
# Identifying the columns
# with Null.Count > 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Null.Count']>1.00)])
0
##################################
# Formulating a new dataset object
# for the cleaned data
##################################
cancer_death_rate_cleaned = cancer_death_rate_filtered_row
cancer_death_rate_cleaned.reset_index(drop=True,inplace=True)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_cleaned.shape)
Dataset Dimensions:
(183, 16)
##################################
# Formulating the cleaned dataset
# with geolocation data
##################################
cancer_death_rate_cleaned_numeric = cancer_death_rate_cleaned.select_dtypes(include='number')
cancer_death_rate_cleaned_numeric_geolocation = cancer_death_rate_cleaned_numeric[['GEOLAT','GEOLON']]
##################################
# Formulating the cleaned dataset
# with numeric columns only
# without the geolocation data
##################################
cancer_death_rate_cleaned_numeric.drop(['GEOLAT','GEOLON'], inplace=True, axis=1)
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(cancer_death_rate_cleaned_numeric.columns)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_death_rate_cleaned_numeric.skew()
##################################
# Computing the interquartile range
# for all columns
##################################
cancer_death_rate_cleaned_numeric_q1 = cancer_death_rate_cleaned_numeric.quantile(0.25)
cancer_death_rate_cleaned_numeric_q3 = cancer_death_rate_cleaned_numeric.quantile(0.75)
cancer_death_rate_cleaned_numeric_iqr = cancer_death_rate_cleaned_numeric_q3 - cancer_death_rate_cleaned_numeric_q1
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((cancer_death_rate_cleaned_numeric < (cancer_death_rate_cleaned_numeric_q1 - 1.5 * cancer_death_rate_cleaned_numeric_iqr)) | (cancer_death_rate_cleaned_numeric > (cancer_death_rate_cleaned_numeric_q3 + 1.5 * cancer_death_rate_cleaned_numeric_iqr))).sum()
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(cancer_death_rate_cleaned_numeric)] * len(cancer_death_rate_cleaned_numeric.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_skewness_list,
numeric_outlier_count_list,
numeric_row_count_list,
numeric_outlier_ratio_list),
columns=['Numeric.Column.Name',
'Skewness',
'Outlier.Count',
'Row.Count',
'Outlier.Ratio'])
display(numeric_column_outlier_summary)
| Numeric.Column.Name | Skewness | Outlier.Count | Row.Count | Outlier.Ratio | |
|---|---|---|---|---|---|
| 0 | PROCAN | 2.2461 | 11 | 183 | 0.0601 |
| 1 | BRECAN | 1.9575 | 8 | 183 | 0.0437 |
| 2 | CERCAN | 1.9896 | 2 | 183 | 0.0109 |
| 3 | STOCAN | 2.0858 | 6 | 183 | 0.0328 |
| 4 | ESOCAN | 2.0918 | 24 | 183 | 0.1311 |
| 5 | PANCAN | 0.5992 | 1 | 183 | 0.0055 |
| 6 | LUNCAN | 0.8574 | 2 | 183 | 0.0109 |
| 7 | COLCAN | 0.8201 | 2 | 183 | 0.0109 |
| 8 | LIVCAN | 8.7158 | 19 | 183 | 0.1038 |
| 9 | SMPREV | 0.4165 | 0 | 183 | 0.0000 |
| 10 | OWPREV | -0.3341 | 0 | 183 | 0.0000 |
| 11 | ACSHAR | 0.3372 | 1 | 183 | 0.0055 |
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in cancer_death_rate_cleaned_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_cleaned_numeric, x=column)
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
def plot_correlation_matrix(corr, mask=None):
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr,
ax=ax,
mask=mask,
annot=True,
vmin=-1,
vmax=1,
center=0,
cmap='coolwarm',
linewidths=1,
linecolor='gray',
cbar_kws={'orientation': 'horizontal'})
##################################
# Computing the correlation coefficients
# and correlation p-values
# among pairs of numeric columns
##################################
cancer_death_rate_cleaned_numeric_correlation_pairs = {}
cancer_death_rate_cleaned_numeric_columns = cancer_death_rate_cleaned_numeric.columns.tolist()
for numeric_column_a, numeric_column_b in itertools.combinations(cancer_death_rate_cleaned_numeric_columns, 2):
cancer_death_rate_cleaned_numeric_correlation_pairs[numeric_column_a + '_' + numeric_column_b] = stats.pearsonr(
cancer_death_rate_cleaned_numeric.loc[:, numeric_column_a],
cancer_death_rate_cleaned_numeric.loc[:, numeric_column_b])
##################################
# Formulating the pairwise correlation summary
# for all numeric columns
##################################
cancer_death_rate_cleaned_numeric_summary = cancer_death_rate_cleaned_numeric.from_dict(cancer_death_rate_cleaned_numeric_correlation_pairs, orient='index')
cancer_death_rate_cleaned_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_cleaned_numeric_summary.sort_values(by=['Pearson.Correlation.Coefficient'], ascending=False).head(20))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| PANCAN_COLCAN | 0.7537 | 0.0000 |
| LUNCAN_COLCAN | 0.7010 | 0.0000 |
| LUNCAN_SMPREV | 0.6415 | 0.0000 |
| PANCAN_LUNCAN | 0.6367 | 0.0000 |
| COLCAN_ACSHAR | 0.5819 | 0.0000 |
| PANCAN_ACSHAR | 0.5750 | 0.0000 |
| PANCAN_OWPREV | 0.5212 | 0.0000 |
| CERCAN_ESOCAN | 0.4803 | 0.0000 |
| LUNCAN_ACSHAR | 0.4330 | 0.0000 |
| STOCAN_LIVCAN | 0.4291 | 0.0000 |
| SMPREV_OWPREV | 0.4164 | 0.0000 |
| COLCAN_SMPREV | 0.4126 | 0.0000 |
| COLCAN_OWPREV | 0.4102 | 0.0000 |
| LUNCAN_OWPREV | 0.4087 | 0.0000 |
| PROCAN_BRECAN | 0.4081 | 0.0000 |
| PROCAN_CERCAN | 0.3650 | 0.0000 |
| PANCAN_SMPREV | 0.3603 | 0.0000 |
| BRECAN_CERCAN | 0.3589 | 0.0000 |
| ESOCAN_LIVCAN | 0.3009 | 0.0000 |
| CERCAN_STOCAN | 0.2790 | 0.0001 |
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
cancer_death_rate_cleaned_numeric_correlation = cancer_death_rate_cleaned_numeric.corr()
mask = np.triu(cancer_death_rate_cleaned_numeric_correlation)
plot_correlation_matrix(cancer_death_rate_cleaned_numeric_correlation,mask)
plt.show()
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
def correlation_significance(df=None):
p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))
for col in df.columns:
for col2 in df.drop(col,axis=1).columns:
_ , p = stats.pearsonr(df[col],df[col2])
p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p
return p_matrix
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
cancer_death_rate_cleaned_numeric_correlation_p_values = correlation_significance(cancer_death_rate_cleaned_numeric)
mask = np.invert(np.tril(cancer_death_rate_cleaned_numeric_correlation_p_values<0.05))
plot_correlation_matrix(cancer_death_rate_cleaned_numeric_correlation,mask)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_cleaned_numeric.shape)
Dataset Dimensions:
(183, 12)
##################################
# Conducting a Yeo-Johnson Transformation
# to address the distributional
# shape of the variables
##################################
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson',
standardize=False)
cancer_death_rate_cleaned_numeric_array = yeo_johnson_transformer.fit_transform(cancer_death_rate_cleaned_numeric)
##################################
# Formulating a new dataset object
# for the transformed data
##################################
cancer_death_rate_transformed_numeric = pd.DataFrame(cancer_death_rate_cleaned_numeric_array,
columns=cancer_death_rate_cleaned_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_death_rate_transformed_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_transformed_numeric, x=column)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_transformed_numeric.shape)
Dataset Dimensions:
(183, 12)
cancer_death_rate_transformed_numeric
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.5595 | 1.5203 | 1.4836 | 2.1487 | 1.1939 | 1.5456 | 2.4417 | 1.9846 | 1.1035 | 4.7108 | 46.3969 | 0.2004 |
| 1 | 1.7272 | 1.4076 | 0.9307 | 1.7470 | 0.6928 | 2.6330 | 3.0570 | 2.0417 | 1.0378 | 6.4628 | 149.2448 | 3.8224 |
| 2 | 1.4670 | 1.4686 | 1.1002 | 1.4007 | 0.6154 | 2.0444 | 2.2954 | 1.9526 | 0.7708 | 4.5421 | 163.6112 | 0.7992 |
| 3 | 1.7704 | 1.5352 | 1.0595 | 1.6330 | 1.0009 | 3.2883 | 3.2603 | 2.6740 | 1.0911 | 7.4729 | 169.3720 | 5.1050 |
| 4 | 1.9120 | 1.6229 | 2.2285 | 1.6682 | 1.2514 | 1.7808 | 2.5816 | 2.0787 | 0.8161 | 3.8904 | 58.1402 | 3.7374 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178 | 1.9905 | 1.5533 | 1.9988 | 1.8183 | 0.8160 | 2.4509 | 2.8083 | 2.1890 | 0.7997 | 5.7294 | 168.3522 | 2.5885 |
| 179 | 1.2905 | 1.6500 | 1.6597 | 1.7169 | 0.9307 | 2.1167 | 3.0677 | 2.4898 | 0.8322 | 6.4983 | 34.8080 | 4.3473 |
| 180 | 1.5002 | 1.4362 | 1.0565 | 2.0113 | 1.0697 | 1.3550 | 2.3069 | 1.8210 | 0.8885 | 5.2531 | 120.5007 | 0.0504 |
| 181 | 1.9556 | 1.6259 | 2.3874 | 1.6536 | 1.3591 | 2.2641 | 2.3685 | 2.2649 | 0.8543 | 4.5909 | 58.9437 | 3.5868 |
| 182 | 2.1649 | 1.7314 | 2.5992 | 1.8555 | 1.3729 | 2.9086 | 2.5914 | 2.2817 | 1.1442 | 4.5666 | 88.2092 | 2.8255 |
183 rows × 12 columns
##################################
# Conducting standardization
# to transform the values of the
# variables into comparable scale
##################################
standardization_scaler = StandardScaler()
cancer_death_rate_transformed_numeric_array = standardization_scaler.fit_transform(cancer_death_rate_transformed_numeric)
##################################
# Formulating a new dataset object
# for the scaled data
##################################
cancer_death_rate_scaled_numeric = pd.DataFrame(cancer_death_rate_transformed_numeric_array,
columns=cancer_death_rate_transformed_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_death_rate_scaled_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_scaled_numeric, x=column)
##################################
# Consolidating both numeric columns
# and geolocation data
##################################
cancer_death_rate_preprocessed = pd.concat([cancer_death_rate_scaled_numeric,cancer_death_rate_cleaned_numeric_geolocation], axis=1, join='inner')
##################################
# Performing a general exploration of the consolidated dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_preprocessed.shape)
Dataset Dimensions:
(183, 14)
##################################
# Segregating the target
# and descriptor variable lists
##################################
cancer_death_rate_preprocessed_target_SMPREV = ['SMPREV']
cancer_death_rate_preprocessed_target_OWPREV = ['OWPREV']
cancer_death_rate_preprocessed_target_ACSHAR = ['ACSHAR']
cancer_death_rate_preprocessed_descriptors = cancer_death_rate_preprocessed.drop(['SMPREV','OWPREV','ACSHAR','GEOLAT','GEOLON'], axis=1).columns
##################################
# Segregating the target using SMPREV
# and descriptor variable names
##################################
y_variable = 'SMPREV'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(x_variables):
ax = axes[i]
ax.scatter(cancer_death_rate_preprocessed[x_variable],cancer_death_rate_preprocessed[y_variable])
ax.set_title(f'{y_variable} Versus {x_variable}')
ax.set_xlabel(x_variable)
ax.set_ylabel(y_variable)
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Segregating the target using OWPREV
# and descriptor variable names
##################################
y_variable = 'OWPREV'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(x_variables):
ax = axes[i]
ax.scatter(cancer_death_rate_preprocessed[x_variable],cancer_death_rate_preprocessed[y_variable])
ax.set_title(f'{y_variable} Versus {x_variable}')
ax.set_xlabel(x_variable)
ax.set_ylabel(y_variable)
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Segregating the target using ACSHAR
# and descriptor variable names
##################################
y_variable = 'ACSHAR'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(x_variables):
ax = axes[i]
ax.scatter(cancer_death_rate_preprocessed[x_variable],cancer_death_rate_preprocessed[y_variable])
ax.set_title(f'{y_variable} Versus {x_variable}')
ax.set_xlabel(x_variable)
ax.set_ylabel(y_variable)
##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()
##################################
# Presenting the subplots
##################################
plt.show()
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using SMPREV
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['OWPREV','ACSHAR','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['SMPREV_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'SMPREV'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| SMPREV_SMPREV | 1.0000 | 0.0000 |
| SMPREV_LUNCAN | 0.6538 | 0.0000 |
| SMPREV_CERCAN | -0.4866 | 0.0000 |
| SMPREV_PROCAN | -0.4232 | 0.0000 |
| SMPREV_COLCAN | 0.4198 | 0.0000 |
| SMPREV_PANCAN | 0.3604 | 0.0000 |
| SMPREV_ESOCAN | -0.2655 | 0.0003 |
| SMPREV_STOCAN | -0.1196 | 0.1070 |
| SMPREV_LIVCAN | 0.1163 | 0.1171 |
| SMPREV_BRECAN | 0.0566 | 0.4465 |
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using OWPREV
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['SMPREV','ACSHAR','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['OWPREV_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'OWPREV'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| OWPREV_OWPREV | 1.0000 | 0.0000 |
| OWPREV_PANCAN | 0.5360 | 0.0000 |
| OWPREV_CERCAN | -0.4677 | 0.0000 |
| OWPREV_ESOCAN | -0.4489 | 0.0000 |
| OWPREV_LUNCAN | 0.4445 | 0.0000 |
| OWPREV_COLCAN | 0.4442 | 0.0000 |
| OWPREV_STOCAN | -0.1189 | 0.1088 |
| OWPREV_BRECAN | 0.0490 | 0.5105 |
| OWPREV_PROCAN | 0.0280 | 0.7072 |
| OWPREV_LIVCAN | -0.0214 | 0.7737 |
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using ACSHAR
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['SMPREV','OWPREV','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['ACSHAR_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'ACSHAR'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| ACSHAR_ACSHAR | 1.0000 | 0.0000 |
| ACSHAR_COLCAN | 0.6039 | 0.0000 |
| ACSHAR_PANCAN | 0.5929 | 0.0000 |
| ACSHAR_LUNCAN | 0.4403 | 0.0000 |
| ACSHAR_PROCAN | 0.2083 | 0.0047 |
| ACSHAR_BRECAN | 0.1759 | 0.0172 |
| ACSHAR_CERCAN | -0.1347 | 0.0690 |
| ACSHAR_STOCAN | -0.1249 | 0.0921 |
| ACSHAR_ESOCAN | 0.0732 | 0.3248 |
| ACSHAR_LIVCAN | -0.0709 | 0.3401 |
##################################
# Consolidating relevant numeric columns
# after hypothesis testing
##################################
cancer_death_rate_premodelling = cancer_death_rate_preprocessed.drop(['GEOLAT','GEOLON'], axis=1)
##################################
# Performing a general exploration of the premodelling dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_premodelling.shape)
Dataset Dimensions:
(183, 12)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_death_rate_premodelling.dtypes)
Column Names and Data Types:
PROCAN float64 BRECAN float64 CERCAN float64 STOCAN float64 ESOCAN float64 PANCAN float64 LUNCAN float64 COLCAN float64 LIVCAN float64 SMPREV float64 OWPREV float64 ACSHAR float64 dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_death_rate_premodelling.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | -0.5405 | -1.4979 | -1.6782 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0.5329 | 0.6090 | 0.4008 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | -0.6438 | 0.9033 | -1.3345 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 1.1517 | 1.0213 | 1.1371 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | -1.0431 | -1.2574 | 0.3520 |
##################################
# Gathering the pairplot for all variables
##################################
sns.pairplot(cancer_death_rate_premodelling,
kind='reg',
plot_kws={'scatter_kws': {'alpha': 0.3}},)
plt.show()
##################################
# Preparing the clustering dataset
##################################
cancer_death_rate_premodelling_clustering = cancer_death_rate_premodelling.drop(['SMPREV','OWPREV','ACSHAR'], axis=1)
cancer_death_rate_premodelling_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 |
##################################
# Fitting the K-Means Clustering algorithm
# using a range of K values
##################################
kmeans_cluster_list = list()
kmeans_cluster_inertia = list()
kmeans_cluster_silhouette_score = list()
for cluster_count in range(2,10):
km = KMeans(n_clusters=cluster_count,
random_state=88888888,
n_init='auto',
init='k-means++')
km = km.fit(cancer_death_rate_premodelling_clustering)
kmeans_cluster_list.append(cluster_count)
kmeans_cluster_inertia.append(km.inertia_)
kmeans_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
km.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the K-Means Clustering algorithm
# using a range of K values
##################################
kmeans_clustering_evaluation_summary = pd.DataFrame(zip(kmeans_cluster_list,
kmeans_cluster_inertia,
kmeans_cluster_silhouette_score),
columns=['KMeans.Cluster.Count',
'KMeans.Cluster.Inertia',
'KMeans.Cluster.Silhouette.Score'])
kmeans_clustering_evaluation_summary
| KMeans.Cluster.Count | KMeans.Cluster.Inertia | KMeans.Cluster.Silhouette.Score | |
|---|---|---|---|
| 0 | 2 | 1238.4894 | 0.2355 |
| 1 | 3 | 1027.3347 | 0.2330 |
| 2 | 4 | 948.1192 | 0.2323 |
| 3 | 5 | 897.3084 | 0.1608 |
| 4 | 6 | 821.6682 | 0.1576 |
| 5 | 7 | 771.4820 | 0.1627 |
| 6 | 8 | 725.5394 | 0.1633 |
| 7 | 9 | 670.6289 | 0.1836 |
###################################
# Plotting the Inertia performance
# by cluster count using a range of K values
# for the K-Means Clustering algorithm
##################################
kmeans_cluster_count_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Count'].values)
kmeans_inertia_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Inertia'].values)
plt.figure(figsize=(10, 6))
plt.plot(kmeans_cluster_count_values, kmeans_inertia_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(500,1500)
plt.title("K-Means Clustering Algorithm: Cluster Count by Inertia")
plt.xlabel("Cluster")
plt.ylabel("Inertia")
plt.show()
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the K-Means Clustering algorithm
##################################
kmeans_cluster_count_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Count'].values)
kmeans_silhouette_score_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(kmeans_cluster_count_values, kmeans_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("K-Means Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final K-Means Clustering model
# using the optimal cluster count
##################################
kmeans_clustering = KMeans(n_clusters=2,
random_state=88888888,
n_init='auto',
init='k-means++')
kmeans_clustering = kmeans_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Inertia and Silhouette Score
# for the final K-Means Clustering model
##################################
kmeans_clustering_inertia = kmeans_clustering.inertia_
kmeans_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
kmeans_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_kmeans_clustering['KMEANS_CLUSTER'] = kmeans_clustering.predict(cancer_death_rate_kmeans_clustering)
cancer_death_rate_kmeans_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_kmeans_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Fitting the Bisecting K-Means Clustering algorithm
# using a range of K values
##################################
bisecting_kmeans_cluster_list = list()
bisecting_kmeans_cluster_inertia = list()
bisecting_kmeans_cluster_silhouette_score = list()
for cluster_count in range(2,10):
bk = BisectingKMeans(n_clusters=cluster_count,
random_state=88888888,
n_init=1,
init='k-means++')
bk = bk.fit(cancer_death_rate_premodelling_clustering)
bisecting_kmeans_cluster_list.append(cluster_count)
bisecting_kmeans_cluster_inertia.append(bk.inertia_)
bisecting_kmeans_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
bk.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Bisecting K-Means Clustering algorithm
# using a range of K values
##################################
bisecting_kmeans_clustering_evaluation_summary = pd.DataFrame(zip(bisecting_kmeans_cluster_list,
bisecting_kmeans_cluster_inertia,
bisecting_kmeans_cluster_silhouette_score),
columns=['Bisecting.KMeans.Cluster.Count',
'Bisecting.KMeans.Cluster.Inertia',
'Bisecting.KMeans.Cluster.Silhouette.Score'])
bisecting_kmeans_clustering_evaluation_summary
| Bisecting.KMeans.Cluster.Count | Bisecting.KMeans.Cluster.Inertia | Bisecting.KMeans.Cluster.Silhouette.Score | |
|---|---|---|---|
| 0 | 2 | 1238.4894 | 0.2355 |
| 1 | 3 | 1080.6399 | 0.2146 |
| 2 | 4 | 955.1301 | 0.1887 |
| 3 | 5 | 891.9650 | 0.1762 |
| 4 | 6 | 843.0145 | 0.1750 |
| 5 | 7 | 798.7791 | 0.1341 |
| 6 | 8 | 758.0470 | 0.1413 |
| 7 | 9 | 714.1712 | 0.1503 |
###################################
# Plotting the Inertia performance
# by cluster count using a range of K values
# for the Bisecting K-Means Clustering algorithm
##################################
bisecting_kmeans_cluster_count_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Count'].values)
bisecting_kmeans_inertia_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Inertia'].values)
plt.figure(figsize=(10, 6))
plt.plot(bisecting_kmeans_cluster_count_values, bisecting_kmeans_inertia_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(500,1500)
plt.title("Bisecting K-Means Clustering Algorithm: Cluster Count by Inertia")
plt.xlabel("Cluster")
plt.ylabel("Inertia")
plt.show()
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Bisecting K-Means Clustering algorithm
##################################
bisecting_kmeans_cluster_count_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Count'].values)
bisecting_kmeans_silhouette_score_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(bisecting_kmeans_cluster_count_values, bisecting_kmeans_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("Bisecting K-Means Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Bisecting K-Means Clustering model
# using the optimal cluster count
##################################
bisecting_kmeans_clustering = BisectingKMeans(n_clusters=2,
random_state=88888888,
n_init=1,
init='k-means++')
bisecting_kmeans_clustering = bisecting_kmeans_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Inertia and Silhouette Score
# for the final Bisecting K-Means Clustering model
##################################
bisecting_kmeans_clustering_inertia = bisecting_kmeans_clustering.inertia_
bisecting_kmeans_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
bisecting_kmeans_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Bisecting K-Means Clustering model
##################################
cancer_death_rate_bisecting_kmeans_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_bisecting_kmeans_clustering['BISECTING_KMEANS_CLUSTER'] = bisecting_kmeans_clustering.predict(cancer_death_rate_bisecting_kmeans_clustering)
cancer_death_rate_bisecting_kmeans_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | BISECTING_KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Bisecting K-Means Clustering model
##################################
cancer_death_rate_bisecting_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_bisecting_kmeans_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='BISECTING_KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_bisecting_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='BISECTING_KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Fitting the GMM Clustering algorithm
# using a range of K values
##################################
gaussian_mixture_cluster_list = list()
gaussian_mixture_cluster_silhouette_score = list()
for cluster_count in range(2,10):
gm = GaussianMixture(n_components=cluster_count,
init_params='k-means++',
random_state=88888888)
gm = gm.fit(cancer_death_rate_premodelling_clustering)
gaussian_mixture_cluster_list.append(cluster_count)
gaussian_mixture_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
gm.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the GMM Clustering algorithm
# using a range of K values
##################################
gaussian_mixture_clustering_evaluation_summary = pd.DataFrame(zip(gaussian_mixture_cluster_list,
gaussian_mixture_cluster_silhouette_score),
columns=['GMM.Cluster.Count',
'GMM.Cluster.Silhouette.Score'])
gaussian_mixture_clustering_evaluation_summary
| GMM.Cluster.Count | GMM.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2239 |
| 1 | 3 | 0.2235 |
| 2 | 4 | 0.2026 |
| 3 | 5 | 0.1205 |
| 4 | 6 | 0.1208 |
| 5 | 7 | 0.1266 |
| 6 | 8 | 0.1320 |
| 7 | 9 | 0.1348 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the GMM Clustering algorithm
##################################
gaussian_mixture_cluster_count_values = np.array(gaussian_mixture_clustering_evaluation_summary['GMM.Cluster.Count'].values)
gaussian_mixture_silhouette_score_values = np.array(gaussian_mixture_clustering_evaluation_summary['GMM.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(gaussian_mixture_cluster_count_values, gaussian_mixture_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("GMM Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final GMM Clustering model
# using the optimal cluster count
##################################
gaussian_mixture_clustering = GaussianMixture(n_components=2,
init_params='k-means++',
random_state=88888888)
gaussian_mixture_clustering = gaussian_mixture_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final GMM Clustering model
##################################
gaussian_mixture_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
gaussian_mixture_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final GMM Clustering model
##################################
cancer_death_rate_gaussian_mixture_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_gaussian_mixture_clustering['GMM_CLUSTER'] = gaussian_mixture_clustering.predict(cancer_death_rate_gaussian_mixture_clustering)
cancer_death_rate_gaussian_mixture_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | GMM_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final GMM Clustering model
##################################
cancer_death_rate_gaussian_mixture_clustering_plot = sns.pairplot(cancer_death_rate_gaussian_mixture_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='GMM_CLUSTER');
sns.move_legend(cancer_death_rate_gaussian_mixture_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='GMM_CLUSTER', frameon=False)
plt.show()
##################################
# Fitting the Agglomerative Clustering algorithm
# using a range of K values
##################################
agglomerative_cluster_list = list()
agglomerative_cluster_silhouette_score = list()
for cluster_count in range(2,10):
ag = AgglomerativeClustering(n_clusters=cluster_count,
linkage='complete')
ag = ag.fit(cancer_death_rate_premodelling_clustering)
agglomerative_cluster_list.append(cluster_count)
agglomerative_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
ag.fit_predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Agglomerative Clustering algorithm
# using a range of K values
##################################
agglomerative_clustering_evaluation_summary = pd.DataFrame(zip(agglomerative_cluster_list,
agglomerative_cluster_silhouette_score),
columns=['Agglomerative.Cluster.Count',
'Agglomerative.Cluster.Silhouette.Score'])
agglomerative_clustering_evaluation_summary
| Agglomerative.Cluster.Count | Agglomerative.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.1629 |
| 1 | 3 | 0.1311 |
| 2 | 4 | 0.1127 |
| 3 | 5 | 0.1617 |
| 4 | 6 | 0.2035 |
| 5 | 7 | 0.1995 |
| 6 | 8 | 0.2006 |
| 7 | 9 | 0.1968 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Agglomerative Clustering algorithm
##################################
agglomerative_cluster_count_values = np.array(agglomerative_clustering_evaluation_summary['Agglomerative.Cluster.Count'].values)
agglomerative_silhouette_score_values = np.array(agglomerative_clustering_evaluation_summary['Agglomerative.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(agglomerative_cluster_count_values, agglomerative_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("Agglomerative Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Agglomerative Clustering model
# using the optimal cluster count
##################################
agglomerative_clustering = AgglomerativeClustering(n_clusters=2,
linkage='complete')
agglomerative_clustering = agglomerative_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final K-Means Clustering model
##################################
agglomerative_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering, agglomerative_clustering.labels_, metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Agglomerative Clustering model
##################################
cancer_death_rate_agglomerative_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_agglomerative_clustering['AGGLOMERATIVE_CLUSTER'] = agglomerative_clustering.fit_predict(cancer_death_rate_agglomerative_clustering)
cancer_death_rate_agglomerative_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | AGGLOMERATIVE_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 1 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 1 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 0 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Agglomerative Clustering model
##################################
cancer_death_rate_agglomerative_clustering_plot = sns.pairplot(cancer_death_rate_agglomerative_clustering,
kind='reg',
markers=['o', 's'],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='AGGLOMERATIVE_CLUSTER');
sns.move_legend(cancer_death_rate_agglomerative_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='AGGLOMERATIVE_CLUSTER', frameon=False)
plt.show()
##################################
# Fitting the Ward Hierarchical Clustering algorithm
# using a range of K values
##################################
ward_hierarchical_cluster_list = list()
ward_hierarchical_cluster_silhouette_score = list()
for cluster_count in range(2,10):
wh = AgglomerativeClustering(n_clusters=cluster_count,
linkage='ward')
wh = wh.fit(cancer_death_rate_premodelling_clustering)
ward_hierarchical_cluster_list.append(cluster_count)
ward_hierarchical_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
wh.fit_predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Ward Hierarchical Clustering algorithm
# using a range of K values
##################################
ward_hierarchical_clustering_evaluation_summary = pd.DataFrame(zip(ward_hierarchical_cluster_list,
ward_hierarchical_cluster_silhouette_score),
columns=['Ward.Hierarchical.Cluster.Count',
'Ward.Hierarchical.Cluster.Silhouette.Score'])
ward_hierarchical_clustering_evaluation_summary
| Ward.Hierarchical.Cluster.Count | Ward.Hierarchical.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2148 |
| 1 | 3 | 0.1924 |
| 2 | 4 | 0.1840 |
| 3 | 5 | 0.1714 |
| 4 | 6 | 0.1858 |
| 5 | 7 | 0.1803 |
| 6 | 8 | 0.1595 |
| 7 | 9 | 0.1689 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Ward Hierarchical Clustering algorithm
##################################
ward_hierarchical_cluster_count_values = np.array(ward_hierarchical_clustering_evaluation_summary['Ward.Hierarchical.Cluster.Count'].values)
ward_hierarchical_silhouette_score_values = np.array(ward_hierarchical_clustering_evaluation_summary['Ward.Hierarchical.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(ward_hierarchical_cluster_count_values, ward_hierarchical_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,1)
plt.title("Ward Hierarchical Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Ward Hierarchical Clustering model
# using the optimal cluster count
##################################
ward_hierarchical_clustering = AgglomerativeClustering(n_clusters=2,
linkage='ward')
ward_hierarchical_clustering = agglomerative_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final Ward Hierarchical model
##################################
ward_hierarchical_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering, ward_hierarchical_clustering.labels_, metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Ward Hierarchical Clustering model
##################################
cancer_death_rate_ward_hierarchical_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_ward_hierarchical_clustering['WARD_HIERARCHICAL_CLUSTER'] = ward_hierarchical_clustering.fit_predict(cancer_death_rate_ward_hierarchical_clustering)
cancer_death_rate_ward_hierarchical_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | WARD_HIERARCHICAL_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 1 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 1 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 0 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Ward Hierarchical Clustering model
##################################
cancer_death_rate_ward_hierarchical_clustering_plot = sns.pairplot(cancer_death_rate_ward_hierarchical_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='WARD_HIERARCHICAL_CLUSTER');
sns.move_legend(cancer_death_rate_ward_hierarchical_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='WARD_HIERARCHICAL_CLUSTER', frameon=False)
plt.show()
##################################
# Consolidating all the
# model performance measures
##################################
clustering_silhouette_score_list = [kmeans_clustering_silhouette_score,
bisecting_kmeans_clustering_silhouette_score,
gaussian_mixture_clustering_silhouette_score,
agglomerative_clustering_silhouette_score,
ward_hierarchical_clustering_silhouette_score]
clustering_silhouette_algorithm_list = ['kmeans_clustering',
'bisecting_kmeans_clustering',
'gaussian_mixture_clustering',
'agglomerative_clustering',
'ward_hierarchical_clustering']
performance_comparison_silhouette_score = pd.DataFrame(zip(clustering_silhouette_algorithm_list,
clustering_silhouette_score_list),
columns=['Clustering.Algorithm',
'Silhouette.Score'])
print('Consolidated Model Performance: ')
display(performance_comparison_silhouette_score)
Consolidated Model Performance:
| Clustering.Algorithm | Silhouette.Score | |
|---|---|---|
| 0 | kmeans_clustering | 0.2355 |
| 1 | bisecting_kmeans_clustering | 0.2355 |
| 2 | gaussian_mixture_clustering | 0.2239 |
| 3 | agglomerative_clustering | 0.1629 |
| 4 | ward_hierarchical_clustering | 0.1629 |
##################################
# Plotting all the Silhouette Score
# model performance measures
##################################
performance_comparison_silhouette_score.set_index('Clustering.Algorithm', inplace=True)
performance_comparison_silhouette_score_plot = performance_comparison_silhouette_score.plot.barh(figsize=(10, 6))
performance_comparison_silhouette_score_plot.set_xlim(0.00,1.00)
performance_comparison_silhouette_score_plot.set_title("Model Comparison by Silhouette Score Performance for Number of Clusters=2")
performance_comparison_silhouette_score_plot.set_xlabel("Silhouette Score Performance")
performance_comparison_silhouette_score_plot.set_ylabel("Clustering Model")
performance_comparison_silhouette_score_plot.grid(False)
performance_comparison_silhouette_score_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in performance_comparison_silhouette_score_plot.containers:
performance_comparison_silhouette_score_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
##################################
# Exploring the selected final model
# using the clustering descriptors
# and K-Means clusters
##################################
cancer_death_rate_kmeans_clustering_descriptor = cancer_death_rate_kmeans_clustering.copy()
cancer_death_rate_kmeans_clustering_descriptor.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_descriptor_plot = sns.pairplot(cancer_death_rate_kmeans_clustering_descriptor,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_descriptor_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Computing the average descriptors
# for each K-Means Cluster
##################################
cancer_death_rate_kmeans_clustering_descriptor['KMEANS_CLUSTER'] = np.where(cancer_death_rate_kmeans_clustering_descriptor['KMEANS_CLUSTER']== 0,'HIGH_PAN_LUN_COL_LIV_CAN','HIGH_PRO_BRE_CER_STO_ESO_CAN')
cancer_death_rate_kmeans_descriptor_clustered = cancer_death_rate_kmeans_clustering_descriptor.groupby('KMEANS_CLUSTER').mean()
display(cancer_death_rate_kmeans_descriptor_clustered)
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | |
|---|---|---|---|---|---|---|---|---|---|
| KMEANS_CLUSTER | |||||||||
| HIGH_PAN_LUN_COL_LIV_CAN | -0.4004 | -0.0894 | -0.7876 | -0.4930 | -0.4541 | 0.6040 | 0.7054 | 0.6445 | 0.0465 |
| HIGH_PRO_BRE_CER_STO_ESO_CAN | 0.3550 | 0.0793 | 0.6983 | 0.4371 | 0.4026 | -0.5355 | -0.6254 | -0.5714 | -0.0413 |
##################################
# Computing the average of the
# clustering descriptors
# for each K-Means Cluster
##################################
plt.figure(figsize=(10, 8))
sns.heatmap(cancer_death_rate_kmeans_descriptor_clustered, annot=True, cmap="seismic")
plt.xlabel('Cancer Types')
plt.ylabel('K-Means Clusters')
plt.title('Heatmap of Death Rates by Cancer Type and K-Means Clusters')
plt.show()
##################################
# Exploring the selected final model
# using the target descriptors
# and K-Means clusters
##################################
cancer_death_rate_kmeans_clustering_target = pd.concat([cancer_death_rate_kmeans_clustering[['KMEANS_CLUSTER']],cancer_death_rate_preprocessed[['SMPREV','OWPREV','ACSHAR']]], axis=1, join='inner')
cancer_death_rate_kmeans_clustering_target['KMEANS_CLUSTER'] = np.where(cancer_death_rate_kmeans_clustering_target['KMEANS_CLUSTER']== 0,'HIGH_PAN_LUN_COL_LIV_CAN','HIGH_PRO_BRE_CER_STO_ESO_CAN')
cancer_death_rate_kmeans_clustering_target.head()
| KMEANS_CLUSTER | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5405 | -1.4979 | -1.6782 |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | 0.5329 | 0.6090 | 0.4008 |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | -0.6438 | 0.9033 | -1.3345 |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | 1.1517 | 1.0213 | 1.1371 |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -1.0431 | -1.2574 | 0.3520 |
##################################
# Computing the target descriptors
# for each K-Means Cluster
##################################
cancer_death_rate_kmeans_target_clustered = cancer_death_rate_kmeans_clustering_target.groupby('KMEANS_CLUSTER').mean()
display(cancer_death_rate_kmeans_target_clustered)
| SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|
| KMEANS_CLUSTER | |||
| HIGH_PAN_LUN_COL_LIV_CAN | 0.6433 | 0.4329 | 0.3218 |
| HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5704 | -0.3838 | -0.2853 |
##################################
# Computing the average of the
# target descriptors
# for each K-Means Cluster
##################################
plt.figure(figsize=(10, 8))
sns.heatmap(cancer_death_rate_kmeans_target_clustered, annot=True, cmap="seismic")
plt.xlabel('Lifestyle Factors')
plt.ylabel('K-Means Clusters')
plt.title('Heatmap of Lifestyle Factors and K-Means Clusters')
plt.show()
##################################
# Exploring the selected final model
# using the location data
# and K-Means clusters
##################################
cancer_death_rate_kmeans_cluster_map = pd.concat([cancer_death_rate_kmeans_clustering_target[['KMEANS_CLUSTER']],cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_cluster_map.head()
| KMEANS_CLUSTER | CODE | |
|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AFG |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | ALB |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | DZA |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | AND |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AGO |
##################################
# Loading world map shapefile
# obtained from https://geojson-maps.ash.ms/
##################################
world = gpd.read_file('custom.geo.json')
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_cluster = world.merge(cancer_death_rate_kmeans_cluster_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by K-Means cluster
##################################
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
world_cluster.boundary.plot(ax=ax, linewidth=1)
world_cluster.plot(column='KMEANS_CLUSTER', cmap="seismic", legend=True, ax=ax, legend_kwds={"loc": "center left", "bbox_to_anchor": (1, 0.5)})
plt.title('KMEANS_CLUSTER')
plt.show()
##################################
# Plotting the map by K-Means descriptors
##################################
cancer_death_rate_kmeans_descriptor_map = pd.concat([cancer_death_rate_kmeans_clustering_descriptor,cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_descriptor_map.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | CODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AFG |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | HIGH_PAN_LUN_COL_LIV_CAN | ALB |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | HIGH_PAN_LUN_COL_LIV_CAN | DZA |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | HIGH_PAN_LUN_COL_LIV_CAN | AND |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AGO |
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_descriptor = world.merge(cancer_death_rate_kmeans_descriptor_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by Pancreatic Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='PANCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "PANCAN"})
plt.title('PANCAN')
plt.show()
##################################
# Plotting the map by Lung Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='LUNCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "LUNCAN"})
plt.title('LUNCAN')
plt.show()
##################################
# Plotting the map by Colon Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='COLCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "COLCAN"})
plt.title('COLCAN')
plt.show()
##################################
# Plotting the map by Liver Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='LIVCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "LIVCAN"})
plt.title('LIVCAN')
plt.show()
##################################
# Plotting the map by Prostate Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='PROCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "PROCAN"})
plt.title('PROCAN')
plt.show()
##################################
# Plotting the map by Breast Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='BRECAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "BRECAN"})
plt.title('BRECAN')
plt.show()
##################################
# Plotting the map by Cervical Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='CERCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "CERCAN"})
plt.title('CERCAN')
plt.show()
##################################
# Plotting the map by Stomach Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='STOCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "STOCAN"})
plt.title('STOCAN')
plt.show()
##################################
# Plotting the map by Esophagus Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='ESOCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "ESOCAN"})
plt.title('ESOCAN')
plt.show()
A detailed report was formulated documenting all the analysis steps and findings.
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))